Practical Issues in Neural Network Training-Regularization
Since a larger number of parameters causes overfitting, a natural approach is to constrain the model to use fewer non-zero parameters.
In the previous example, if we constrain the vector W̄ to have only one non-zero component out of five components,
it will correctly obtain the solution [2,0,0,0,0]. Smaller absolute values of the parameters also tend to overfit less.
Since it is hard to constrain the values of the parameters, the softer approach of adding the penalty λ||W̄||
p to
the loss function is used. The value of
p is typically set to 2, which leads to Tikhonov regularization.
In general, the squared value of each parameter (multiplied with the regularization parameter λ>0) is added to the objective function.
The practical effect of this change is that a quantity proportional to
λWi is subtracted from the update of the parameter
Wi.
An example of a regularized version of Equation1.6 for mini-batch S and update step-size α > 0 is as follows:
Here,E[X̄] represents the current error
(y-ŷ) between observed and predicted values of training instance X.
One can view this type of penalization as a kind of weight decay during the updates.
Regularization is particularly important when the amount of available data is limited.
A neat biological interpretation of regularization is that it corresponds to gradual forgetting,
as a result of which "less important" (i.e.,noisy) patterns are removed.
In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization.
As a side note, the general form of Equation 1.33 is used by many regularized machine learning models like least-squares regression, whereE(X̄)
is replaced by the error-function of that specific model. Interestingly, weight decay is only sparingly used in the single-layer perceptron because it can sometimes cause overly rapid
forgetting with a small number of recently misclassified training points dominating the weight vector; the main issue is that the perceptron criterion
is already a degenerate loss function with a minimum value of 0 at W̄= 0 (unlike its hinge-loss or least-squares cousins).
This quirk is a legacy of the fact that the single-layer perceptron was originally defined in terms of biologically inspired updates rather than in terms of carefully thought-out
loss functions. Convergence to an optimal solution was never guaranteed other than in linearly separable cases.
For the single-layer perceptron, some other regularization techniques, which will be discussed in the coming posts are more commonly used.